Skip to content

Conversation

@vaibhav45sktech
Copy link
Contributor

@vaibhav45sktech vaibhav45sktech commented Jan 3, 2026

This PR updates the CLI to correctly handle custom file extensions and multiple content variants. Previously, there was no syntax to submit these specifics via the CLI, forcing users to rely on complex SPARQL queries to exclude unwanted variants. I have implemented a new pipe-separated syntax (e.g., URL|key=value|.ext) that allows users to pass metadata directly with the artifact URL. This implementation maintains backward compatibility with the existing underscore-separated format (key1=value1_key2=value2) while adding support for the new, more flexible pipe syntax.

Summary by CodeRabbit

  • New Features

    • Support for supplying distributions as pre-parsed structured data or legacy text; parser extracts metadata, variants, format and compression.
    • Classic CLI deploy flow now parses distribution specs before submission.
    • Improved backend queries to better aggregate and expose content variants.
  • Tests

    • Added tests covering distribution parsing, variant handling, and dataset creation integration.

✏️ Tip: You can customize this high-level summary in your review settings.

issue #32

@coderabbitai
Copy link

coderabbitai bot commented Jan 3, 2026

📝 Walkthrough

Walkthrough

Adds support for pre-parsed distribution dictionaries to the deploy API, updates the CLI to parse pipe-separated distribution strings into dicts, and introduces a SPARQL query constant with a helper to parse aggregated content-variant strings.

Changes

Cohort / File(s) Summary
Core API - Distribution Format Support
databusclient/api/deploy.py
Added _get_file_info_from_dict(dist_dict) to extract file metadata from parsed distribution dicts. Updated create_dataset(...) signatures to accept Union[List[str], List[Dict]] and added branching logic to handle both dict and string distribution inputs while preserving multi-file validation.
Queries Module
databusclient/api/queries.py
Added ONTOLOGIES_QUERY SPARQL constant aggregating content variants via GROUP_CONCAT and parse_content_variants_string(variants_str: str) -> dict to parse that aggregated string into a dict (supports key=value and standalone values).
CLI Distribution Parser
databusclient/cli.py
Added parse_distribution_str(dist_str: str) that parses pipe-separated distribution specs into dicts (url, variants, formatExtension, compression) and wired the classic deploy flow to pass parsed dicts to api_deploy.create_dataset. Emits warnings for unrecognized modifiers.
Tests — CLI parsing & integration
tests/test_parse_distribution.py
New tests covering parse_distribution_str (URL extraction, variants parsing, format/compression detection, edge cases) and integration tests mocking _load_file_stats to verify _get_file_info_from_dict and create_dataset behavior with parsed distributions.
Tests — Misc
tests/test_deploy.py
Added module-level constant EXAMPLE_URL.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Refactor/project structure #39: Extends and modifies the same deploy API surface by adding parsed-distribution support and changing create_dataset signatures.
  • feat Nextcloudclient #29: Related CLI/deploy changes that introduce distribution-from-metadata helpers and adjustments to the deploy flow.

Suggested reviewers

  • Integer-Ctrl
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: enabling CLI support for content variants and custom extensions, which matches the primary objective of the PR.
Description check ✅ Passed The description explains the changes and references issue #32, but is missing key template sections like Type of change, Checklist items, and self-review confirmation.
Docstring Coverage ✅ Passed Docstring coverage is 97.22% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/test_deploy.py (1)

19-47: Test coverage is good; consider adding formatExtension assertion in the last test.

The tests comprehensively cover multiple scenarios. However, the final test case (lines 44-46) validates only compression but doesn't verify that formatExtension is None as expected.

🔎 Suggested test completeness improvement
     # Test with compression only (no format extension)
     result = parse_distribution_str("http://example.com/data|.gz")
     assert result["url"] == "http://example.com/data"
     assert result["compression"] == "gz"
+    assert result["formatExtension"] is None
+    assert result["variants"] == {}
databusclient/cli.py (1)

15-57: Fix indentation and consider extracting the compression list.

  1. Line 50: The print statement appears to have inconsistent indentation (5 spaces instead of 4). Verify this matches the project's style.

  2. Line 37: The hardcoded compression extensions list is a pragmatic heuristic, but extracting it as a module-level constant would improve maintainability:

🔎 Suggested improvements
+# Common compression file extensions
+COMPRESSION_EXTENSIONS = {'.gz', '.zip', '.br', '.tar', '.zst'}
+
 def parse_distribution_str(dist_str: str):
     """
     Parses a distribution string with format:
     URL|key=value|...|.extension
     
     Returns a dictionary suitable for the deploy API.
     """
     parts = dist_str.split('|')
     url = parts[0].strip()
     
     variants = {}
     format_ext = None
     compression = None
     
     # Iterate over the modifiers (everything after the URL)
     for part in parts[1:]:
         part = part.strip()
         
         # Case 1: Extension (starts with .)
         if part.startswith('.'):
             # purely heuristic: if it looks like compression (gz, zip, br), treat as compression
             # otherwise treat as format extension
-            if part.lower() in ['.gz', '.zip', '.br', '.tar', '.zst']:
+            if part.lower() in COMPRESSION_EXTENSIONS:
                 compression = part.lstrip('.') # remove leading dot for API compatibility if needed
             else:
                 format_ext = part.lstrip('.')
         
         # Case 2: Content Variant (key=value)
         elif '=' in part:
             key, value = part.split('=', 1)
             variants[key.strip()] = value.strip()
             
         # Case 3: Standalone tag (treat as boolean variant or ignore? 
         # For now, we assume it's a value for a default key or warn)
         else:
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+            print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 138a10c.

📒 Files selected for processing (4)
  • databusclient/api/deploy.py
  • databusclient/api/queries.py
  • databusclient/cli.py
  • tests/test_deploy.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_deploy.py (1)
databusclient/cli.py (1)
  • parse_distribution_str (15-57)
🔇 Additional comments (5)
databusclient/api/queries.py (1)

9-51: LGTM: Well-structured SPARQL query with proper variant handling.

The query correctly uses GROUP_CONCAT with OPTIONAL to handle distributions that may lack content variants, and appropriately filters development versions.

databusclient/api/deploy.py (3)

176-208: LGTM: Correctly extracts metadata from pre-parsed dictionaries.

The function properly handles optional fields and falls back to loading file stats when necessary. The defaults for formatExtension ("file") and compression ("none") align with the legacy inference logic.

Note: The function expects byteSize in the input dict (line 202) but returns content_length, which matches the API entity field naming convention.


304-348: LGTM: Clear documentation of dual-format distributions.

The updated signature and docstring clearly communicate that distributions can be either legacy strings or pre-parsed dictionaries, maintaining backward compatibility.


366-387: LGTM: Clean dual-format handling with proper type discrimination.

The implementation correctly routes dict and string distributions through their respective parsers while maintaining consistent downstream processing.

databusclient/cli.py (1)

130-138: LGTM: Clean integration of distribution parsing into the deploy workflow.

The list comprehension correctly transforms raw distribution strings into structured dictionaries before passing them to the API layer.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
databusclient/cli.py (1)

1-12: Past review comment about unused re import appears resolved.

The previous review mentioned an unused re import at line 4, but the current code shows from typing import List at that line. The issue appears to have been addressed.

🧹 Nitpick comments (2)
databusclient/cli.py (2)

14-14: Add return type annotation.

The function should include a return type annotation for better type safety and documentation.

🔎 Proposed fix
+from typing import List, Dict, Any
+
-def parse_distribution_str(dist_str: str):
+def parse_distribution_str(dist_str: str) -> Dict[str, Any]:

49-49: Use click.echo() instead of print() for consistency.

For CLI applications using Click, prefer click.echo() over print() for better cross-platform compatibility and consistency with the rest of the CLI output.

🔎 Proposed fix
-             print(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
+             click.echo(f"WARNING: Unrecognized modifier '{part}' in distribution. Expected '.ext' or 'key=val'.")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 138a10c and 3b3c28c.

📒 Files selected for processing (3)
  • databusclient/api/queries.py
  • databusclient/cli.py
  • tests/test_deploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • databusclient/api/queries.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/cli.py (1)
databusclient/api/deploy.py (1)
  • create_dataset (304-465)
🔇 Additional comments (1)
tests/test_deploy.py (1)

15-16: LGTM!

Good refactoring to define the example URL as a constant. This improves maintainability and follows DRY principles.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/test_parse_distribution.py (4)

92-121: Consider adding a test for case-insensitive compression detection.

The implementation uses .lower() to match compression extensions (as seen in the relevant code snippet), but there's no test verifying that .GZ, .Zip, etc. are correctly detected. This ensures the case-insensitive behavior is maintained.

🔎 Suggested test addition

Add this test to the compression detection section:

def test_compression_case_insensitive(self):
    """Test that compression detection is case-insensitive."""
    result = parse_distribution_str("http://example.com/file|.GZ")
    assert result["compression"] == "gz"
    
    result = parse_distribution_str("http://example.com/file|.Zip")
    assert result["compression"] == "zip"

21-30: Consider adding a test for URLs containing pipe characters.

The current parsing logic splits on | to separate the URL from modifiers. If a URL contains a pipe character (though rare, it's valid in URLs), the parsing would break. Consider adding a test to document this limitation or suggest URL encoding.

🔎 Suggested test addition

Add this test to the URL extraction section:

def test_url_with_pipe_character(self):
    """Test handling of URLs containing pipe characters (edge case)."""
    # URLs can technically contain pipe characters
    # This test documents current behavior (would break parsing)
    # Consider URL encoding pipes as %7C in practice
    url_with_pipe = "http://example.com/data?filter=a%7Cb"  # %7C is URL-encoded pipe
    result = parse_distribution_str(f"{url_with_pipe}|lang=en")
    assert result["url"] == url_with_pipe

63-121: Consider using parametrized tests to reduce duplication.

The format extension tests (lines 63-87) and compression detection tests (lines 92-121) follow a repetitive pattern. Using pytest.mark.parametrize could make these more maintainable and concise.

🔎 Example parametrized test refactor

Replace the individual format extension tests with:

@pytest.mark.parametrize("extension,expected", [
    ("json", "json"),
    ("ttl", "ttl"),
    ("csv", "csv"),
    ("xml", "xml"),
])
def test_format_extension_detection(self, extension, expected):
    """Test format extension detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{extension}")
    assert result["formatExtension"] == expected

Similarly for compression tests:

@pytest.mark.parametrize("compression,expected", [
    ("gz", "gz"),
    ("zip", "zip"),
    ("br", "br"),
    ("tar", "tar"),
    ("zst", "zst"),
])
def test_compression_detection(self, compression, expected):
    """Test compression detection for various types."""
    result = parse_distribution_str(f"http://example.com/file|.{compression}")
    assert result["compression"] == expected

126-146: Consider adding a test for multiple extensions of the same type.

The current implementation would overwrite if multiple format extensions (e.g., |.json|.xml) or multiple compression types (e.g., |.gz|.zip) are specified. A test documenting this behavior would help prevent confusion.

🔎 Suggested test addition

Add to the edge cases section:

def test_multiple_format_extensions_last_wins(self):
    """Test that when multiple format extensions are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.json|.xml")
    # Last extension specified should be used
    assert result["formatExtension"] == "xml"

def test_multiple_compression_types_last_wins(self):
    """Test that when multiple compression types are specified, the last one wins."""
    result = parse_distribution_str("http://example.com/file|.gz|.zip")
    # Last compression specified should be used
    assert result["compression"] == "zip"
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3c28c and e8704ab.

📒 Files selected for processing (1)
  • tests/test_parse_distribution.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_parse_distribution.py (2)
databusclient/cli.py (1)
  • parse_distribution_str (14-56)
databusclient/api/deploy.py (2)
  • create_dataset (304-465)
  • _get_file_info_from_dict (176-208)
🔇 Additional comments (4)
tests/test_parse_distribution.py (4)

1-11: LGTM! Imports are appropriate for the test scope.

The imports correctly include the main function under test (parse_distribution_str) and the deployment API components needed for integration testing. Importing the private function _get_file_info_from_dict is acceptable for integration testing purposes.


151-176: LGTM! Edge cases are well-tested.

The tests properly verify whitespace handling, standalone tag warnings (including both stdout capture and result validation), and the minimal URL-only case. The test at lines 158-166 correctly ensures that unrecognized modifiers trigger a warning and are not added to the variants dict.


181-208: LGTM! Integration tests correctly verify parsed dictionaries work with deploy API.

The mocking strategy is appropriate, using _load_file_stats to avoid actual file operations. The test correctly verifies that:

  • Parsed variants, extensions, and compression are properly extracted
  • Default values ("file" for extension, "none" for compression) are applied when not specified
  • The SHA-256 hash length is correct (64 characters)

209-275: LGTM! Comprehensive integration tests verify end-to-end functionality.

These tests effectively verify that:

  • Parsed distribution dictionaries are correctly consumed by create_dataset
  • The generated dataset structure includes proper @context and @graph
  • Distribution fields are correctly mapped, including content variants prefixed with dcv:
  • Multiple distributions with different language variants are properly represented

The approach of using generator expressions with next() (lines 234-237, 266-269) to locate specific graphs is clean and idiomatic.

@vaibhav45sktech
Copy link
Contributor Author

@Integer-Ctrl Could u please review my pr .

@Integer-Ctrl Integer-Ctrl self-requested a review January 9, 2026 14:08
Copy link
Contributor

@Integer-Ctrl Integer-Ctrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some general feedback. I commented on #32 to gain more insight and clarity on the issue/goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File seems useless? Is ONTOLOGIES_QUERY or parse_content_variants_string used at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. The usage is currently indirect, which I agree is not very clear. I’ll add clarification (or refactor) to make the purpose and usage of ONTOLOGIES_QUERY / parse_content_variants_string more explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep cli.py as compact as possible and move logic (methods) always to the according CLI option (in this case, deploy.py).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep cli.py as compact as possible and move logic (methods) always to the according CLI option (in this case, deploy.py).

sure @Integer-Ctrl i would incorporate the suggested chnages

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@databusclient/api/deploy.py`:
- Around line 218-249: The function _get_file_info_from_dict currently calls
_load_file_stats(url) even when url is empty; add an explicit validation in
_get_file_info_from_dict to fail fast: if either sha256sum or content_length is
missing and the local variable url is falsy/empty, raise a clear exception
(e.g., ValueError) indicating the missing required "url" instead of calling
_load_file_stats; keep the existing fallback to _load_file_stats only when url
is present. Ensure references to url, sha256sum, content_length,
_get_file_info_from_dict and _load_file_stats are used so reviewers can locate
the change.

In `@databusclient/cli.py`:
- Around line 133-147: The create_dataset call in cli.py has a syntax error
(missing comma) and passes the wrong variable: remove the duplicate positional
args (first line with version_id, title, abstract, description, license_url,
parsed_distributions) or add the missing comma and replace the final
distributions argument with parsed_distributions so api_deploy.create_dataset
receives the structured list; update the call that references
api_deploy.create_dataset (and ensure you constructed parsed_distributions via
parse_distribution_str) to use keyword args version_id / artifact_version_title
/ artifact_version_abstract / artifact_version_description / license_url and
distributions=parsed_distributions.

Comment on lines +218 to +249
def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]:
"""
Extract file info from a pre-parsed distribution dictionary.

Parameters
----------
dist_dict : dict
A dictionary with keys: url, variants, formatExtension, compression
(as returned by parse_distribution_str in cli.py)

Returns
-------
Tuple containing:
- cvs: Dict of content variants
- format_extension: File format extension
- compression: Compression type
- sha256sum: SHA-256 hash of file
- content_length: File size in bytes
"""
url = dist_dict.get("url", "")
cvs = dist_dict.get("variants", {})
format_extension = dist_dict.get("formatExtension") or "file"
compression = dist_dict.get("compression") or "none"

# Check if sha256sum and content_length are provided
sha256sum = dist_dict.get("sha256sum")
content_length = dist_dict.get("byteSize")

# If not provided, load from URL
if sha256sum is None or content_length is None:
sha256sum, content_length = _load_file_stats(url)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate required url before fallback download.

If a caller passes a dict without url, _load_file_stats("") raises a low-signal requests error. Fail fast with a clear exception before any network call.

🛠️ Proposed fix
-def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]:
+def _get_file_info_from_dict(dist_dict: Dict[str, any]) -> Tuple[Dict[str, str], str, str, str, int]:
     """
@@
-    url = dist_dict.get("url", "")
+    url = dist_dict.get("url")
+    if not url:
+        raise BadArgumentException("Distribution dict missing required 'url' field.")
🤖 Prompt for AI Agents
In `@databusclient/api/deploy.py` around lines 218 - 249, The function
_get_file_info_from_dict currently calls _load_file_stats(url) even when url is
empty; add an explicit validation in _get_file_info_from_dict to fail fast: if
either sha256sum or content_length is missing and the local variable url is
falsy/empty, raise a clear exception (e.g., ValueError) indicating the missing
required "url" instead of calling _load_file_stats; keep the existing fallback
to _load_file_stats only when url is present. Ensure references to url,
sha256sum, content_length, _get_file_info_from_dict and _load_file_stats are
used so reviewers can locate the change.

Comment on lines +133 to +147
# --- CHANGE START ---
# Parse the input strings into structured objects
parsed_distributions = [parse_distribution_str(d) for d in distributions]

# Note: api_deploy.create_dataset now accepts this list of dicts
dataid = api_deploy.create_dataset(
version_id, title, abstract, description, license_url, parsed_distributions
version_id=version_id,
artifact_version_title=title,
artifact_version_abstract=abstract,
artifact_version_description=description,
license_url=license_url,
distributions=distributions
)
# --- CHANGE END ---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check the actual file content at the specified lines
cat -n databusclient/cli.py | sed -n '130,150p'

Repository: dbpedia/databus-python-client

Length of output: 1099


🏁 Script executed:

# Also search for the create_dataset function signature to understand expected parameters
rg -n "def create_dataset" --type py -A 10

Repository: dbpedia/databus-python-client

Length of output: 764


🏁 Script executed:

# Find where api_deploy is defined/imported to understand the module
rg -n "api_deploy" databusclient/cli.py | head -20

Repository: dbpedia/databus-python-client

Length of output: 396


Fix syntax error in create_dataset call and use correct variable.

Line 139 is missing a comma before the keyword arguments on line 140, and you're passing distributions instead of parsed_distributions on line 145. This won't parse and breaks the CLI.

Fix
-        dataid = api_deploy.create_dataset(
-            version_id, title, abstract, description, license_url, parsed_distributions
-            version_id=version_id,
-            artifact_version_title=title,
-            artifact_version_abstract=abstract,
-            artifact_version_description=description,
-            license_url=license_url,
-            distributions=distributions
-        )
+        dataid = api_deploy.create_dataset(
+            version_id=version_id,
+            artifact_version_title=title,
+            artifact_version_abstract=abstract,
+            artifact_version_description=description,
+            license_url=license_url,
+            distributions=parsed_distributions,
+        )
🧰 Tools
🪛 GitHub Actions: Python CI (Lint & pytest)

[error] 140-140: Ruff check failed with SyntaxError: Expected ',', found name at databusclient/cli.py:140:13. Command: 'poetry run ruff check --output-format=github .'

🪛 GitHub Check: build

[failure] 140-140: Ruff
databusclient/cli.py:140:13: SyntaxError: Expected ',', found name

🤖 Prompt for AI Agents
In `@databusclient/cli.py` around lines 133 - 147, The create_dataset call in
cli.py has a syntax error (missing comma) and passes the wrong variable: remove
the duplicate positional args (first line with version_id, title, abstract,
description, license_url, parsed_distributions) or add the missing comma and
replace the final distributions argument with parsed_distributions so
api_deploy.create_dataset receives the structured list; update the call that
references api_deploy.create_dataset (and ensure you constructed
parsed_distributions via parse_distribution_str) to use keyword args version_id
/ artifact_version_title / artifact_version_abstract /
artifact_version_description / license_url and
distributions=parsed_distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants